5 Raw Data Processing

Before analysis the lagged correlation by lagci, you may get lots of data. This tutorial will get started with you on processing raw data, until get the data with correct format.

5.1 Get started with example data

Run this chunk, you will get the correct format of the data needed in lagci package. The data structure should have two columns at least, one of the columns is time column with POSIXct format, and another column is value.

example_data_1 <- data.frame(
  datetime = seq(from = as.POSIXct("2003-01-09"),to = as.POSIXct("2004-01-09"),by = "day")
)

example_data_1$value <- sin(seq(from = 0, to = 2*pi, length.out = nrow(example_data_1)))

head(example_data_1)

    datetime      value
1 2003-01-09 0.00000000
2 2003-01-10 0.01721336
3 2003-01-11 0.03442161
4 2003-01-12 0.05161967
5 2003-01-13 0.06880243
6 2003-01-14 0.08596480

class(example_data_1$datetime)

[1] "POSIXct" "POSIXt"

5.1.1 Type one: separated date

If the structure of your raw date is like below:

example_data_2 <- data.frame(
  year = rep(x = 2000:2025, each = 12),
  month = rep(x = 1:12, times = 26),
  value = sin(seq(from = 0, to = 2*pi, length.out = 12*26))
)

head(example_data_2)

  year month      value
1 2000     1 0.00000000
2 2000     2 0.02020179
3 2000     3 0.04039534
4 2000     4 0.06057240
5 2000     5 0.08072474
6 2000     6 0.10084413

you should use this script to correct your format:

This code builds a two-column data frame where time is first created as a character string by concatenating year, month, and the fixed day "01", and value is copied from the original data;

it then shows that time is initially of class "character", converts it to a real timestamp with as.POSIXct() (you can add tz = "UTC" for reproducibility), prints the first few rows to verify the result, and finally confirms that time is now of class c("POSIXct","POSIXt").

In short, it turns separate year/month information into a proper POSIXct time column plus a numeric value column—i.e., the tidy format expected by downstream lagged-correlation tools;

if you want stricter parsing, use zero-padded months via sprintf("%04d-%02d-01", year, month) before the conversion.

example_data_2_correct <- data.frame(
  time = paste0(example_data_2$year,"-",example_data_2$month,"-01"),
  value = example_data_2$value
)

class(example_data_2_correct$time)

[1] "character"

example_data_2_correct$time <- as.POSIXct(example_data_2_correct$time)

head(example_data_2_correct)

        time      value
1 2000-01-01 0.00000000
2 2000-02-01 0.02020179
3 2000-03-01 0.04039534
4 2000-04-01 0.06057240
5 2000-05-01 0.08072474
6 2000-06-01 0.10084413

class(example_data_2_correct$time)

[1] "POSIXct" "POSIXt"

5.1.2 Type two: omics data format

You may get several files, when you get the omics data:

This chunk creates a toy omics-style wide table: it generates 5 human-readable feature IDs via ids::adjective_animal(), builds a half-hourly POSIXct time vector from 2019-04-29 03:30 to 2019-05-06 21:30, and simulates 5 time series using a noisy sine curve.

The replicate() output is transposed with t() so rows = features, columns = time; column names are set to the timestamps, and head(omics_data[1]) previews the first time column.

For robustness and reproducibility: call set.seed(1) before replicate(), explicitly coerce column names with as.character(time_index), and ensure nrow(omics_data) == nrow(IDs) so IDs match features.

If the ids package isn’t installed, fall back to data.frame(ids = paste0("id_", 1:5)). Later, convert this wide table to a tidy long format (id/time/value) before lagged-correlation analysis.

IDs <- data.frame(
  ids = ids::adjective_animal(n = 5)
    )

head(IDs)

                         ids
1 prepolitical_homalocephale
2 preevolutional_belugawhale
3       acculturative_cuscus
4         zoographical_leech
5    deistical_englishsetter

library(magrittr)

time_index <- seq.POSIXt(as.POSIXct("2019-04-29 03:30"),
                         as.POSIXct("2019-05-06 21:30"),
                         by = "30 min")

omics_data <- replicate(5, 
                  sin(seq(0, 10, length.out = length(time_index))) + rnorm(length(time_index), 0, 0.2)) %>%
        t() %>% as.data.frame()
colnames(omics_data) <- time_index

head(omics_data[1])

  2019-04-29 03:30:00
1         0.004541227
2        -0.113343339
3        -0.182325960
4         0.247461356
5        -0.133177177

you should transform them into five files with correct format:

This code converts the omics wide table into per-feature tidy frames. First, omics_data is transposed so rows = time, columns = features (full_data <- omics_data %>% t() %>% as.data.frame()), then column names are set from IDs$ids.

The loop builds df_list: for each feature, it creates a two-column data frame with time parsed from the row names (timestamps) and the corresponding value; head(df_list[[1]]) previews the first feature’s time–value series.

full_data <- omics_data %>% 
  t() %>% 
  as.data.frame()

colnames(full_data) <- IDs$ids

head(full_data)

                    prepolitical_homalocephale preevolutional_belugawhale
2019-04-29 03:30:00                0.004541227               -0.113343339
2019-04-29 04:00:00                0.070978772                0.094855738
2019-04-29 04:30:00                0.077887042                0.054871044
2019-04-29 05:00:00                0.293393153               -0.087391267
2019-04-29 05:30:00                0.185966587                0.007135642
2019-04-29 06:00:00               -0.069981955                0.038041818
                    acculturative_cuscus zoographical_leech
2019-04-29 03:30:00           -0.1823260         0.24746136
2019-04-29 04:00:00            0.1956677         0.08338463
2019-04-29 04:30:00           -0.2860073         0.02519943
2019-04-29 05:00:00            0.2339092        -0.16757358
2019-04-29 05:30:00            0.1279799         0.11960344
2019-04-29 06:00:00           -0.2741057        -0.13719129
                    deistical_englishsetter
2019-04-29 03:30:00            -0.133177177
2019-04-29 04:00:00             0.258469889
2019-04-29 04:30:00            -0.009899799
2019-04-29 05:00:00            -0.148279988
2019-04-29 05:30:00            -0.183354948
2019-04-29 06:00:00             0.083825397

df_list <- list()

for (i in 1:ncol(full_data)) {
  varname <- colnames(full_data)[i]
  df_list[[varname]] <- data.frame(
    time = as.POSIXct(rownames(full_data), format = "%Y-%m-%d %H:%M:%S"),
    value = full_data[[i]]
  )
}


head(df_list[[1]])

                 time        value
1 2019-04-29 03:30:00  0.004541227
2 2019-04-29 04:00:00  0.070978772
3 2019-04-29 04:30:00  0.077887042
4 2019-04-29 05:00:00  0.293393153
5 2019-04-29 05:30:00  0.185966587
6 2019-04-29 06:00:00 -0.069981955

5.2 Session information

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Singapore
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 ids_1.0.1         compiler_4.5.1    fastmap_1.2.0    
 [5] cli_3.6.5         tools_4.5.1       htmltools_0.5.8.1 rstudioapi_0.17.1
 [9] uuid_1.2-1        rmarkdown_2.29    knitr_1.50        jsonlite_2.0.0   
[13] xfun_0.53         digest_0.6.37     rlang_1.1.6       evaluate_1.0.4